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Abstract 

In a variety of disciplines such as social sci- 
ences, psychology, medicine and economics, the 
recorded data are considered to be noisy mea- 
surements of latent variables connected by some 
causal structure. This corresponds to a fam- 
ily of graphical models known as the structural 
equation model with latent variables. While 
linear non-Gaussian variants have been well- 
studied, inference in nonparametric structural 
equation models is still underdeveloped. We in- 
troduce a sparse Gaussian process parameteriza- 
tion that defines a non-linear structure connect- 
ing latent variables, unlike common formulations 
of Gaussian process latent variable models. The 
sparse parameterization is given a full Bayesian 
treatment without compromising Markov chain 
Monte Carlo efficiency. We compare the stabil- 
ity of the sampling procedure and the predictive 
ability of the model against the current practice. 



1 CONTRIBUTION 

A cornerstone principle of many disciplines is that obser- 
vations are noisy measurements of hidden variables of in- 
terest. This is particularly prominent in fields such as so- 
cial sciences, psychology, marketing and medicine. For 
instance, data can come in the form of social and eco- 
nomical indicators, answers to questionnaires in a medical 
exam or marketing survey, and instrument readings such as 
fMRI scans. Such indicators are treated as measures of la- 
tent factors such as the latent ability levels of a subject in 
a psychological study, or the abstract level of democrati- 
zation of a coun try. The literature on structural equation 
models (SEMs) jBartholomew et all 120081; iBollenl 1 19891) 
approaches such problems with directed graphical models, 
where each node in the graph is a noisy function of its par- 
ents. The goals of the analysis include typical applications 
of latent variable models, such as projecting points in a la- 



tent space (with confidence regions) for ranking, cluster 
ing and visualization; density e stimation; missing data im 
putati on; and causal inference (Pearl, 2000; ISpirtes et al. 
20001) . 



This paper introduces a nonparametric formulation of 
SEMs with hidden nodes, where functions connecting la- 
tent variables are given a Gaussian process prior. An effi- 
cient but flexible sparse formulation is adopted. To the best 
of our knowledge, our contribution is the first full Gaussian 
process treatment of SEMs with latent variables. 

We assume that the model graphical structure is given. 
Structural model selection with latent variables is a com- 
plex topic which we will not pursue here: a detailed 
discussion of model selection is left as fu t ure w ork. 
Asparouhov and Muthlnl d2009l) and lSilva et al. (2006J) dis- 
cuss relevant issues. Our goal is to be able to generate 
posterior distributions over parameters and latent variables 
with scalable sampling procedures with good mixing prop- 
erties, while being competitive against non-sparse Gaus- 
sian process models. 

In Section |2] we specify the likelihood function for our 
structural equation models and its implications. In Sec- 
tion |2 we elaborate on priors, Bayesian learning, and a 
sparse variation of the basic model which is able to handle 
larger datasets. Section|4]describes a Markov chain Monte 
Carlo (MCMC) procedure. Section [5] evaluates the useful- 
ness of the model and the stability of the sampler in a set of 
real-world SEM applications with comparisons to modern 
alternatives. Finally, in Section[6]we discuss related work. 



2 THE MODEL: LIKELIHOOD 

Let Q be a given directed acyclic graph (DAG). For sim- 
plicity, in this paper we assume that no observed variable 
is a parent in Q of any l atent variable. Many SEM appli - 



cations are of this type (Bollen, 1989t ISilva et all |2006) 



and this will simplify our presentation. Likewise, we will 
treat models for continuous variables only. Although cyclic 
SEMs are also well-defined for the linear case (IBollenl 
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Figure 1 : (a) An example adapted from IPalomo et al, latent variable IL corresponds to a scalar labeled as the 

industrialization level of a country. PDL is the corresponding political democratization level. Variables Yi, Y2, Y3 are 
indicators of industrialization (e.g., gross national product) while Y4, . . . , I7 are indicators of democratization (e.g., expert 
assessements of freedom of press). Each variable is a function of its parents with a corresponding additive error term: q for 
each Yi, and £ for democratization levels. For instance, PDL — f(IL) + £ for some function /(•). (b) Dependence among 
latent variables is essential to obtain sparsity in the measurement structure. Here we depict how the graphical dependence 
structure would look like if we regressed the observed variables on the independent latent variables of (a). 



1989), non-linear cyclic models are not trivial to define and 
as such we will exclude them from this paper. 

Let X be our set of latent variables and Xj £ X be a partic- 
ular latent variable. Let Xp ; be the set of parents of Xi in 
Q. The latent structure in our SEM is given by the follow- 
ing generative model: if the parent set of Xi is not empty, 



statistics (e.g., IWood et al.1 d2006h : IZou et alj d2006l) ). not 
much has been done on exploring nonparametric models 
with dependent latent structure (a loosely related excep- 
tion being dynamic systems, where filtering is the typical 
application). Figure |TJb) illustrates how modeling can be 
affected by discarding the structure among latento 



Xi = fi(X Pi ) + 0, where Q ~ JV(0, v Ci ) (1) 2.1 Identiflability Conditions 



JV(m, v) is the Gaussian distribution with mean m and 
variance v. If Xi has no parents (i.e., it is an exogenous 
latent variable, in SEM terminology), it is given a mixture 
of Gaussians marginaQ. 

The measurement model, i.e., the model that describes the 
distribution of observations y given latent variables X, is 
as follows. For each Yj £ y with parent set Xp^ , we have 



y j = \)0 + Xp. K,j + ej , where tj ~ Af(0, v e 



(2) 



Error terms {e^} are assumed to be mutually independent 
and independent of all latent variables in X. Moreover, Aj 
is a vector of linear coefficients Aj — [Xj\ . . . \j\jt P . \V ■ 
Following SEM terminology, we say that Yj is an indicator 
of the latent variables in X p. . 

An ex ample is show n in Figure [Tf a). Following the nota- 
tion of Eolienl (l 989), squares represent observed variables 
and circles, latent variables. SEMs are graphical models 
with an emphasis on sparse models where: 1. latent vari- 
ables are dependent according to a directed graph model; 
2. observed variables measure (i.e., are children of) very 
few latent variables. Although sparse latent variable mod- 
els have been the object of study in machine learning and 



For simplicity of presentation, in this paper we adopt a fi- 
nite mixture of Gaussians marginal for the exogenous variables. 
However, introducing a Dirichlet process mixture of Gaussians 
marginal is conceptually straightforward. 



Latent variable models might be unidentifiable. In the con- 
text of Bayesian inference, this is less of a theoretical issue 
than a computational one: unidentifiable models might lead 
to poor mixing in MCMC, as discussed in Section|5] More- 
over, in many applications, the latent embedding of the data 
points is of interest itself, or the latent regression functions 
are relevant for causal inference purposes. In such appli- 
cations, an unidentifiable model is of limited interest. In 
this Section, we show how to derive sufficient conditions 
for identiflability. 

Consider the case where a latent variable Xi has at least 
three unique indicators ^ = {Yi a , Yip, Y^}, in the sense 
that no element in 3^ has any other pa rent in Q but Xi. It is 
known that in this case (Bollen, 1989) the parameters of the 
structural equations for each element of X are identifiable 
(i.e., the linear coefficients and the error term variance) up 
to a scale and sign of the latent variable. This can be re- 
solved by setting the linear structural equation of (say) Yi a 
to Yi a = Xi + ti a - The distribution of the error terms 
is then identifiable. The distribution of Xi follows from a 



Another consequence of modeling latent dependencies is re- 
ducing the number of parameters of the model: a SEM with a lin- 
ear measurement model can be seen as a type of module network 
dSegal et al.Ll2005f) where the observed children of a particular la- 
tent Xi share the same nonlinearities propagated from Xp ; : in 
the context of Figure Q] each indicator Yi £ {Y4, . . . , Yj} has a 
conditional expected value of A;o + Xnf2(Xi) for a given X%: 
function /iQ is shared among the indicators of X2. 



deconvolution between the observed distribution of an ele- 
ment of and the identified distribution of the error term. 

Identifiability of the joint of X can be resolved by mul- 
tivariat e deconvolut ion under extra assumptions. For in- 
stance, iMasrvl (120031) describes the problem in the context 
of kernel density estimation (with known joint distribution 
of error terms, but unknown joint of y ). 

Assumptions for the identifiability of functions /;(•), given 
the identifiability of the joint of X, have been dis- 
cussed in the li t erature of error-in- variable s regression 
dFan and Truongi 1 19931; ICarroll et all |2004 . Error-in- 



variables regression is a special case of our problem, where 
Xi is observed but Xp ; is not. However, since we have 
Yi a = Xi + €{, this is equivalent to a error-in-variables re- 
gression Yi a = /i(XpJ + £i a + d, where the compound 
error term €i a + Q is still independent of Xp ; . 

It can be shown that such identifiability conditions can be 
exploited in order to identify causal directionality among 
lat ent variables under a dditional assumptions, as discussed 
bv lHover et al. ( 2008a ) for the fully observed cas^l A brief 
discussion is presented in the Appendix. In our context, 
we focus on the implications of identifiabilty on MCMC 
(SectionE). 



size N is represented as {Z' 1 ), . . . , Z^}, where Z is the 
set of all variables. Lower case x represents fixed val- 
ues of latent variables, and x 1:Ar represents a whole set 

{xW xt")}. 

For each Xp i; the corresponding Gaussian process prior for 



function values fl :N = {fj; , • • ■ , fi" 1 } is 



e l:N 



r l:N 



■Af(0,Ki) 



where is a N x TV 

(Ras mussen and Williamsl |2006), as 



xp . Each corresponding x 
in Equation (Q}. 



(d) . , Ad) . Ad) 

' is given by /> ' + Q ' 



kernel matrix 
determined by 

as 



MCMC can be used to sample from the posterior distri- 
bution over latent variables and functions. However, each 
sampling step in this model costs 0(N 3 ), making sampling 
very slow when N is at the order of hundreds, and essen- 
tially undoable when N is in the thousands. As an alter- 
native, we introduce a multilayer ed representation adapted 
from the pseudo-inputs model of ISnelson and Ghahramani 
(2006). The goal is to reduce the sampling cost down to 
0(M 2 N), M < N. M can be chosen according to the 
available computational resources. 



3 THE MODEL: PRIORS 



Each fj(-) can be given a Gaussian process prior 



(Ras mussen and Wi lliams. 2006). In this case, we call this 
class of models the GPSEM-LV family, standing for Gaus- 
sian Process Structural Equation Model with Latent Vari- 
ables. Models without latent var iables and measurement 
model s have been discussed by Friedman and Nachmanl 
(l2000lR 

3.1 Gaussian Process Prior and Notation 

Let Xi be an arbitrary latent variable in the graph, with 
latent parents Xp. . We will use X( d ) to represent the 
d th X sampled from the distribution of random vector X, 



and ■ indexes its i th component. For instance, X 
is the d th sample of the parents of X^, A training set of 



(<0 



3 Notice that if the distribution of the error terms is non- 
Gaussian, identification is easier: we only need two unique in- 
dicators Yi a and Yip: since ticAiP ar, d Xi are mutually in- 
dependent, identification follows from known results derived in 
the literature of ove rcomplete independent component analysis 
jHover et aUl2008bl) . 

4 To see how th e Gaussian process networks of 

Friedman and Nachman (2000) are a special case of GPSEM-LV, 
imagine a model where each latent variable is measured without 
error. That is, each Xi has at least one observed child Yi such that 
Yi = Xi. The measurement model is still linear, but each struc- 
tural equation among latent variables can be equivalently written 
in terms of the observed variables: i.e., Xi = /j(Xp 4 ) + Q is 
equivalent to Yi = /, (Yp i ) + as in Friedman and Nachman. 



3.2 Pseudo-inputs Review 

We briefly review the pseudo-inputs model 
( Snelson and G hahramani. 2006) in our notation. As 
before, let X' d ) represent the d th data point for some X. 



For a set Xj :N = {x\ 1] , . 
parent set Xpf = {X^\ 



jj*'} with corresponding 



X 



p^} and corresponding 



latent function values f/ , we define a pseudo-input set 



-V.M 



{X 



(i) 



,X| M) } such that 



-X 1:N f Tt 1 



1 M 

i 

1:M 



• A'iK, : . v ,;K :W f,. V, 

^(0,K i;M ) 



(3) 



where ~K^mm is a N x M matrix with each (j, k) ele- 
ment given by the kernel function fej(xp. ). Simi- 
larly, K-i-M is a M x M matrix where element (j, k) is 
ki(x.^\x\ k '). It is important to notice that each pseudo- 
input X| d) , d = 1, . . . , M, has the same dimensionality as 
Xp. The motivation for this is that X; works as an alter- 
native training set, with the original prior predictive means 
and variances being recovered if M = N and X; = Xp ; . 

Let kj ; dM be the d th row of K^jvm- Matrix Vj is 
a diagonal matrix with entry Vi-dd given by v^d = 



fej(x 



(<*) Jd) 

Pi ' Pi 



tent function values {f[ X \ 
dependent. 



■ ^J ;dM K i;M^;dM- This implies that all la- 



, } are conditionally in- 



(a) 



(b) 



Figure 2: (a) This figure depicts a model for the latent structure X\ — > X2 with N = 3 (edges into latent functions are 
lighter for visualization purposes only) using a standard Gaussian process prior. Dashed arrows represent that function val- 
ues {Z^ 1 "* , > f-f^ } are mutually dependent even after conditioning on {x[ 1] , x[ 2) , x{ 3) }. In (b), we have the graphical 
depiction of the respective Bayesian pseudo-inputs model with M = 2. Althought the model is seemingly more complex, 
it scales much better: mutual dependencies are confined to the clique of pseudo-functions, which scales by M instead. 



3.3 Pseudo-inputs: A Fully Bayesian Formulation 



The density function implied by © replaces the stan- 
dard Gaussian process prior. In the context of 



Snelson and Ghahramanil ( 20061) . input and output vari- 
ables are observed, and as such Snelson and Ghahramani 
optimize 5cj :M by maximizing the marginal likelihood of 
the model. This is practical but sometimes prone to over- 
fitting, since pseudo-inputs are in fact free parameters, and 
the pseudo-inputs model is best seen as a variation of the 
Gaussian proc ess prior rather than an approximation to it 
dTitsiasL [2009b . 



In our setup, there is limited motivation to optimize the 
pseudo-inputs since the inputs themselves are random vari- 
ables. For instance, we show in the next section that the 
cost of sampling pseudo-inputs is no greater than the cost 
of sampling latent variables, while avoiding cumbersome 
optimization techniques to choose pseudo-input values. In- 
stead we put a prior on the pseudo-inputs and extend the 
sampling procedure. By conditioning on the data, a good 
placement for the pseudo-inputs can be learned, since Xp ; 
and x| d ' are dependent in the posterior. This is illustrated 
by Figure [2] Moreover, it naturally provides a protection 
against overfitting. 

A simple choice of priors for pseudo-inputs is as fol- 
lows: each pseudo-input , d — 1, . . . , M, is given 
a J\f([if, Sf) prior, independent of all other random vari- 
ables. A partially informative (empirical) prior can be eas- 
ily defined in the case where, for each X^, we have the 
freedom of choosing a particular indicator Y q with fixed 
structural equation Y q = X k + e q (see Section |2~TT i, imply- 
ing -Ef-Xfe] = This means if X^ is a parent X i7 we 
set the respective entry in /if (recall /j,f is a vector with an 
entry for every parent of Xi) to the empirical mean of Y q . 
Each prior covariance matrix 'Ef is set to be diagonal with 
a common variance. 

Alternatively, we would like to spread the pseudo-inputs a 



priori: other things being equal, pseudo-inputs that are too 
close to each can be wasteful given their limited number. 
One prior, inspired by sp ace-filling designs fr om the exper- 
imental design literature ( iSantner et all 12003 ), is 



p(x l i:M )«det(D l ) 

the determinant of a kernel matrix D^. We use a 
squared exponentia l covariance function with charac teristic 



length scale of 0.1 (IRasmussen and Williams , 2006), and a 
"nugget" constant that adds 10~ 4 to each diagonal term. 
This prior has support over a [-L, L]l Xp ; hypercube. We 
set L to be three times the largest standard deviation of ob- 
served variables in the training data. This is the pseudo- 
input prior we adopt in our experiments, where we center 
all observed variables at their empirical means. 

3.4 Other Priors 

We adopt standard priors for the parametric components of 
this model: independent Gaussians for each coefficient Ay, 
inverse gamma priors for the variances of the error terms 
and a Dirichlet prior for the distribution of the mixture in- 
dicators of the exogenous variables. 

4 INFERENCE 

We use a Metropolis-Hastings scheme to sample from our 
space of latent variables and parameters. Similarly to Gibbs 
sampling, we sample blocks of random variables while 
conditioning on the remaining variables. When the corre- 
sponding conditional distributions are canonical, we sam- 
ple directly from them. Otherwise, we use mostly standard 
random walk proposals. 

Conditioned on the latent variables, sampling the parame- 
ters of the measurement model is identical to the case of 
classical Bayesian linear regression. The same can be said 
of the sampling scheme for the posterior variances of each 
fi. Sampling the mixture distribution parameters for the ex- 
ogenous variables is also identical to the standard Bayesian 



case of Gaussian mixture models. Details are described in 
the Appendix. 

We describe the central stages of the sampler for the sparse 
model. The sampler for the model with full Gaussian pro- 
cess priors is simpler and analogous, and also described in 
the Appendix. 

4.1 Sampling Latent Functions 

In principle, one can analytically marginalize the pseudo- 
functions fy . However, keeping an explicit sample of 
the pseudo-functions is advantageous when sampling la- 
tent variables x[ d \ d = 1, . . . , N: for each child X c of 
Xi, only the corresponding factor for the conditional den- 
sity of fj^ needs to be computed (at a O(M) cost), since 
function values are independent given latent parents and 
pseudo-function s. This issue does not arise in the fully- 
observed case of ISnelson and Ghahramanil (120061) . who do 
marginalize the pseudo-functions. 

Pseudo-functions and functions {f^ : , f^ : } are jointly 
Gaussian given all other random variables and data. The 
conditional distribution of fl'- M given everything, except 



itself and {/, 
matrix 



(i) 



f\ N> }, is Gaussian with covariance 



-I/i^JK^atmKj. 



M) 



where V,; is defined in Section 13.21 and I is a M X M 
identity matrix. The total cost of computing this matrix 
is 0(NM 2 + M 3 ) = 0(NM 2 ). The corresponding mean 
is 



Si x K: ,K 



(V^+I/^Jx 



l-.N 



where xj : N is a column vector of length N. 

Given that fP M is sampled according to this multivariate 
Gaussian, we can now sample {f^, . . . , } in parallel, 
since this becomes a mutually independent set with univari- 
ate Gaussian marginals. The conditional variance of is 
v[ = 1/(1/^^(2 + l/^C;)' where Vi-dd is defined in Section 



3.21 The corresponding mean is v[{f^ /v^dd 



M) 



where fff^ 



In Section|5] we also sample from the posterior distribution 
of the hyperparameters of the kernel function used by 
fZ-i-M and K^-jvAf- Plain Metropolis-Hastings is used to 
sample these hyperparameters, using an uniform proposal 

in [a&i, (l/a)0i] for < a < 1. 

4.2 Sampling Pseudo-inputs and Latent Variables 

We sample each pseudo-input xj one at a time, d = 
1,2,..., M, Recall that x^ is a vector, with as many 
entries as the number of parents of X^. In our implementa- 
tion, we propose all entries of the new x ^ simultaneously 



using a Gaussian random walk proposal centered at 5t^ 
with the same variance in each dimension and no correla- 
tion structure. For problems where the number of parents 
of Xi is larger than in our examples (i.e., four or more par- 
ents), other proposals might be justified. 

Let 7fi (xj ) be the conditional prior for given 
x,[ V) , where (\d) = {1, 2, . . . , d-1, d+1, M}. Given 
a proposed ' , we accept the new value with probability 
minjl^xf ')A(xf)} where 



1 ld=l u i-dd e 



(d) 



f(\d) - 

:dM K. .,f, 



2 /(2Vi;dd) 



and p(f 



(d) | r(\d) 



Xi) is the conditional density that fol- 



lows from Equation (01. Row vector k^Af is the d th row 
of matrix K^.^m- Fast submatrix updates of and 
KjjjvAfK^^ are required in order to calculate Zj(-) at a 
O(N M) cost , which can be done by standard Cholesky up- 
dates dSeegeHHooi)- The total cost is therefore 0(NM 2 ) 
for a full sweep over all pseudo-inputs. 



f£ , is known to be 



The conditional density p(f l 
sharply pe aked for moderate s izes of M (at the order of 
hundreds) dTitsias et al ., 2009), which may cause mixing 
problems for the Markov chain. One way to mitigate this 
effect is to also propose a value f- d ^ jointly with x^ , 
which is possible at no additional cost. We propose the 
pseudo-function using the conditional p{fl \ fj; , Xj). 
The Metropolis-Hastings acceptance probability for this 

variation is then simplified to min |l, if(x^ )/Zj(x| rf -')|, 
where 



Z°(xi d) ) 



r_- (x, ; 



11(2=1 u i:dd c 



Finally, consider the proposal for latent variables x[ d \ For 
each latent variable Xi, the set of latent variable instan- 
tiations {X^\ x[ 2 \ . . . ,X> N '} is mutually independent 
given the remaining variables. We propose each new la- 
tent variable value x+ in parallel, and accept or reject it 
based on a Gaussian random walk proposal centered at the 
current value xf\ We accept the move with probability 
min |l, hxi{xf^ )/hxi{%f^)^ where, if Xj is not an ex- 
ogenous variable in the graph, 

hxM d) ) 



e -(-< d) -/i d, ) 2 /(2^) 



p(y { c d) I x£>) 



c6X c , 

x Lly.eYc, 



where is the set of latent children of Xi in the graph, 
and Yd is the corresponding set of observed children. 



The conditional p(fc | fc, x c , x\ a> ), which follows from 
(01, is a non-linear function of x[ d \ but crucially does not 
depend on any x\'' variable except point d. The evaluation 
of this factor costs 0(M 2 ). As such, sampling all latent 
values for X t takes 0(NM 2 ). 

The case where Xi is an exogenous variable is analogous, 
given that we also sample the mixture component indica- 
tors of such variables. 



5 EXPERIMENTS 

In this evaluation Section^, we briefly illustrate the algo- 
rithm in a synthetic study, followed by an empirical eval- 
uation on how identifiability matters in order to obtain an 
interpretable distribution of latent variables. We end this 
section with a study comparing the performance our model 
in predictive tasks against common alternative^. 

5.1 An Illustrative Synthetic Study 

We generated data from a model of two latent variables 
(X 1 ,X 2 ) where X 2 = 4X 2 + £a, *S = X 1 + a far 
i = 1,2,3 and Y t = X 2 + e it for i = 4,5, 6. X x and all 
error terms follow standard Gaussians. Given a sample of 
150 points from this model, we set the structural equations 
for Y\ and I4 to have a zero intercept and unit slope for 
identifiability purposes. Observed data for Y\ against I4 is 
shown in Figure |3j a), which suggests a noisy quadratic re- 
lationship (plotted in|3jb), but unknown to the model). We 
run a GPSEM-LV model with 50 pseudo-inputs. The ex- 
pected posterior value of each latent pair {x[ d \ } for 
d = 1, . . . , 150 is plotted in Figure EJc). It is clear that we 
were able to reproduce the original non-linear functional 
relationship given noisy data using a pseudo-inputs model. 



5 MATLAB code to run all of our experiments is available 
at |http : / /www . homepages ■ ucl ■ ac ■ uk/~ucgtrbd/| 
code/gpsem. zip 

6 Some implementation details: we used the squared expo- 
nential kernel function fc(x p ,x 9 ) = aexp( — ^|x p — x g | 2 ) + 
10~ 4 S pq , where S pq — 1 is p = q and otherwise. The hyper- 
prior for a is a mixture of a gamma (1, 20) and a gamma (10, 10) 
with equal probability each. The same (independent) prior is 
given to b. Variance parameters were given inverse gamma (2, 
1) priors, and the linear coefficients were given Gaussian priors 
with a common large variance of 5. Exogenous latent variables 
were modeled as a mixture of five Gaussians where the mixture 
distribution is given a Dirichlet prior with parameter 10. Finally, 
for each latent Xi variable we choose one of its indicators Yj and 
fix the corresponding edge coefficient to 1 and intercept to to 
make the model identifiable. We perform 20, 000 MCMC iter- 
ations with a burn-in period of 2000 (only 6000 iterations with 
1000 of burn-in for the non-sparse GPSEM-LV due to its high 
computational cost). Small variations in the priors for coefficients 
(using a variance of 10) and variance parameters (using an inverse 
gamma (2, 2)), and a mixture of 3 Gaussians instead of 5, were 
attempted with no significant differences between models. 



For comparison, the outp ut of the Gaussian process latent 
variable model (GPLVM, lLawrencei 120051) with two hid- 
den variables is shown in Figure (3d). GPLVM here as- 
sumes that the marginal distribution of each latent variable 
is a standard Gaussian, but the measurement model is non- 
parametric. In theory, GPLVM is as flexible as GPSEM- 
LV in terms of representing observed joints. However, it 
does not learn functional relationships among latent vari- 
ables, which is often of central interest in SEM applications 
dBollenl \l9S9). Moreover, since no marginal dependence 
among latent variables is allowed, the model adapts itself 
to find (unidentifiable) functional relationships between the 
exogenous latent variables of the true model and the ob- 
servables, analogous to the case illustrated by Figure [Tib). 
As a result, despite GPLVM being able to depict, as ex- 
pected, some quadratic relationship (up to a rotation), it is 
noisier than the one given by GPSEM-LV. 

5.2 MCMC and Identifiability 

We now explore the effect of enforcing identifiability con- 
straints on the MCMC procedure. We consider the dataset 
Consumer, a stud>Hwifh 3 33 university students in Greece 



(Bartholo mew et all 120081) . The aim of the study was to 



identify the factors that affect willingness to pay more to 
consume environmentally friendly products. We selected 
16 indicators of environmental beliefs and attitudes, mea- 
suring a total of 4 hidden variables. For simplicity, we will 
call these variables X\, . . . , X4, The structure among la- 
tents is Xi ->• X 2 , X ->■ X a , X 2 ->■ X a , X 2 -> X 4 . Full 
details are given by Bar tholomew et al. I (l2008l) . 



All observed variables have a single latent parent in the cor- 
responding DAG. As discussed in Section 12.11 the corre- 
sponding measurement model is identifiable by fixing the 
structural equation for one indicator of each variable to 



have a zero intercept and unit slope (Bar tholomew et al 
2008). If the assumptions described in the references of 



Section lZTI hold, then the latent functions are also identifi- 
able. We normalized the dataset before running the MCMC 
inference algorithm. 

An evaluation of the MCMC procedure is done by running 
and comparing 5 indepen dent chains , each starting from a 
different point. Following IL"eel(l2007l). we evaluate conver- 
gence using the EPSR statistic dGelman and Rubini Il992l) . 
which compares the variability of a given marginal pos- 
terior within each chain and between chains. We calcu- 
late this statistic for all latent variables {X%, X 2 , X3, X4} 
across all 333 data points. 

A comparison is done against a variant of the model where 
the measurement model is not sparse: instead, each ob- 
served variable has all latent variables as parents, and no 



7 There was one latent variable marginally independent of ev- 
erything else. We eliminated it and its two indicators, as well as 
the REC latent variable that had only 1 indicator. 




Figure 3: (a) Plot of observed variables Y± and Y4 generated by adding standard Gaussian noise to two latent variables Xi 
and X2, where X2 = <±X\ + £2, (2 also a standard Gaussian. 150 data points were generated, (b) Plot of the corresponding 
latent variables, which are not recorded in the data, (c) The posterior expected values of the 150 latent variable pairs 
according to GPSEM-LV. (d) The posterior modes of the 150 pairs according to GPLVM. 



5 independent chains, sparse model 5 chains, unidentifiable model 5 independent chains, sparse model 5 chains, unidentifiable model 




Iteration Iteration Iteration Iteration 



Figure 4: An illustration of the be havior of independent cha ins for X^ 10 ^ and X^'"' using two models for the Consumer 
data: the original (sparse) model jBartholomew et aL , 20081) : an (unidentifiable) alternative where the each observed vari- 
able is an indicator of all latent variables. In the unidentifiable model, there is no clear pattern across the independent 
chains. Our model is robust to initialization, while the alternative unidentifiable approach cannot be easily interpreted. 



,(200) 



coefficients are fixed. The differences are noticeable and 
illustrated in Figure |4] Box-plots of EPSR for the 4 latent 
variables are shown in Figure [5] It is difficult to interpret 
or trust an embedding that is strongly dependent on the ini- 
tialization procedure, a s it is the case for the unidentifiable 
model. As discussed bv lPalomo et al. (2007), identifiability 
might not be a fundamental issue for Bayesian inference, 
but it is an important practical issue in SEMs. 

5.3 Predictive Verification of the Sparse Modei 

We evaluate how well the sparse GPSEM-LV model 
performs compared against two parametric SEMs and 
GPLVM. The linear structural equation model is the SEM, 
where each latent variable is given by a linear combina- 
tion of its parents with additive Gaussian noise. Latent 
variables without parents are given the same mixture of 
Gaussians model as our GPSEM-LV implementation. The 
quadratic model includes all quadratic and linear terms, 
plus first-order interactions, among the parents of any given 
latent variable. This is per haps the most common non- 
linear SEM used in practice (IBollen and Paxtorill998tlLee , 
20071) . GPLVM is fit with 50 active points a nd the rbf 
kernel with automatic relevance determination (ILawrence 



2005). Each sparse GPSEM model uses 50 pseudo-points. 



We performed a 5-fold cross-validation study where the av- 
erage predictive log-likelihood on the respective test sets is 
reported. Three datasets are used. The first is the Con- 
sumer dataset, described in the previous section. 



The second is the Abalone data (Asuncion and Newman, 



2007), where we postulate two latent variables, "Size" and 
"Weight." Size has as indicators the length, diameter and 
height of each abalone specimen, while Weight has as indi- 
cators the four weight variables. We direct the relationship 
among latent variables as Size Weight. 



The t hird is the Housing dataset (Asu ncion and Newman , 
2007[ lHarrison and RubinfekJ Il978l) . which includes in- 
dicators about features of suburbs in Boston that are rel- 
evant for the housing market . Fo llowing the original 



study (lHarrison and Rubinfeldl 119781 Table IV), we pos- 
tulate three latent variables: "Structural," corresponding 
to the structure of each residence; "Neighborhood," corre- 
sponding to an index of neighborhood attractiveness; and 
"Accessibility," corresponding to an index of accessibil- 
ity within Bostorfl The corresponding 1 1 non-binary ob- 



8 The analysis by ( Harrison and Rubinfeld, 1978, Table IV) 



EPSR distribution, consumer data 



Abalone: Predictive Latent Variables 



Housing: Predictive Latent Variables 



Figure 5: Boxplots for the EPSR distribution across each of 
the 333 datapoints of each latent variable. Boxes represent 
the distribution for the non-sparse model. A value less than 
1.1 is considered acceptable evidence of convergence (Lee, 
2007), but this essentially never happens. For the sparse 
model, all EPSR statistics were under 1.03. 



served variables that are associated with the given latent 
concepts are used as indicators. The "Neighborhood" con- 
cept was refined into two, "Neighborhood I" and "Neigh- 
borhood II" due to the fact that three of its original in- 
dicators have very similar (and highly skewed) marginal 
distributions, which were very dissimilar from the oth- 
er|| The structure among latent variables is given by a 
fully connected network directed according to the order 
{Acc essibility, Structural, Neighbo rhood II, Neighborhood 
I}. lHarrison and Rubinfeld ( 1978 ) provide full details on 
the meaning of the indicators. We note that it is well known 
that the Housing dataset poses stability problems to density 
estimation due to discontinuities i n the variable RAD, one 
of the indicators of accessibility ( Friedman and Nachmanl 



2000). In order to get more stable results, we use a subset 



of the data (374 points) where RAD < 24. 

The need for non-linear SEMs is well-illustrated by Figure 
[6] where fantasy samples of latent variables are generated 
from the predictive distributions of two models. 

We also evaluate how the non-sparse GPSEM-LV behaves 
compared to the sparse alternative. Notice that while Con- 
sumer and Housing have each approximately 300 training 
points in each cross-validation fold, Abalone has over 3000 
points. For the non-sparse GPSEM, we subsampled all of 



also included a fourth latent concept of "Air pollution," which we 
removed due to the absence of one of its indicators in the elec- 
tronic data file that is available. 

9 The final set of indicators, using the nomenclature of the 
UCI repository documentation file, is as follows: "Structural" 
has as indicators RM and AGE; "Neighborhood I" has as in- 
dicators CRIM, ZN and B; "Neighborhood II" has as indica- 
tors INDUS, TAX, PTRATIO and LSTAT; "Accessibility" 
has a s indicators DIS and RAD. See dAsuncion and Newmanl 
2007) for detailed information about these indicators. Following 
Harrison and Rubinfield, we log-transformed some of the vari- 
ables: INDUS, DIS, RAD and TAX. 





Accessibility 



Figure 6: Scatterplots of 2000 fantasy samples taken from 
the predictive distributions of sparse GPSEM-LV models. 
In contrast, GPLVM would generate spherical Gaussians. 



Abalone training folds down to 300 samples. 

Results are presented in Table Q] Each dataset was cho- 
sen to represent a particular type of problem. The data 
in Consumer is highly linear. In particular, it is impor- 
tant to point out that the GPSEM-LV model is able to be- 
have as a standard structural equation model if necessary, 
while the quadratic polynomial model shows some overra- 
ting. The Abalone study is known for having clear func- 
tional relationships amo n g var iables, as also discussed by 
Friedman and Nachman (2000). In this case, there is a 



substantial difference between the non-linear models and 
the linear one, although GPLVM seems suboptimal in this 
scenario where observed variables can be easily clustered 
into groups. Finally, functional r elationships among vari- 
ables in Housing are not as clear dFriedman and Nachmanl 
2000), with multimodal residuals. GPSEM still shows 



an advantage, but all SEMs are suboptimal compared to 
GPLVM. One explanation is that the DAG on which the 
models rely is not adequate. Structure learning might be 
necessary to make the most out of nonparametric SEMs. 

Although results suggest that the sparse model behaved 
better that the non-sparse one (which was true of some 
cases found by Snelson and Ghahramani, 2006, due to het- 
eroscedasticity effects), such results should be interpreted 
with care. Abalone had to be subsampled in the non-sparse 
case. Mixing is harder in the non-sparse model since all 
datapoints {X^\ . . . , X^} are dependent. While we be- 
lieve that with larger sample sizes and denser latent struc- 
tures the non-sparse model should be the best, large sample 
sizes are too expensive to process and, in many SEM appli- 
cations, latent variables have very few parents. 

It is also important to emphasize that the wallclock sam- 
pling time for the non-sparse model was an order of mag- 
nitude larger than the sparse case with M = 50 — even con- 
sidering that 3000 training points were used by the sparse 
model in the Abalone experiment, against 300 points by 
the non-sparse alternative. 





Consumer 


Abalone 


Housing 




GPS GP 


LIN QDR GPL 


GPS GP 


LIN QDR GPL 


GPS GP 


LIN QDR GPL 


Fold 1 
Fold 2 
Fold 3 
Fold 4 
Fold 5 


-20.66 -21.17 
-21.03 -21.15 
-20.86 -20.88 
-20.79 -21.09 
-21.26 -21.76 


-20.67 -21.20 -22.11 
-21.06 -21.08 -22.22 
-20.84 -20.90 -22.33 
-20.78 -20.93 -22.03 
-21.27 -21.75 -22.72 


-1.96 -2.08 
-1.90 -2.97 
-1.91 -5.50 
-1.77 -2.96 
-3.85 -4.56 


-2.75 -2.00 -3.04 
-2.52 -1.92 -3.41 
-2.54 -1.93 -3.65 
-2.30 -1.80 -3.40 
-4.67 -3.84 -4.80 


-13.92 -14.10 
-15.07 -17.70 
-13.66 -15.75 
-13.30 -15.98 
-13.80 -14.46 


-14.46-14.11 -11.94 
-16.20 -15.12 -12.98 
-14.86 -14.69 -12.58 
-14.05 -13.90 -12.84 
-14.67 -13.71 -11.87 



Table 1: Average predictive log-likelihood in a 5-fold cross-validation setup. The five methods are the GPSEM-LV model 
with 50 pseudo-inputs (GPS), GPSEM-LV with standard Gaussian process priors (GP), t he linear and qua dratic structural 
equation models (LIN and QDR) and the Gaussian process latent variable model (GPL) of Lawrence! ( 20051) . a nonparamet- 
ric factor analysis model. For Abalone, GP uses a subsample of the training data. The p-values given by a paired Wilcoxon 
signed-rank test, measuring the significance of positive differences between sparse GPSEM-LV and the quadratic model, 
are 0.03 (for Consumer), 0.34 (Abalone) and 0.09 (Housing). 



6 RELATED WORK 

Non-linear factor analysis has been studied for decades in 
the psychometrics literaturj^l A review is provided by 
Yalcin and Amemiva 120011) . However, most of the clas- 
sic work is based on simple parametric models. A modern 
approach based on Gaussian p rocesses is t he Ga ussian pro- 
cess latent variable model of iLawrencd (120051) . By con- 
struction, factor analysis cannot be used in applications 
where one is interested in learning functions relating la- 
tent variables, such as in causal inference. For embed- 
ding, factor analysis is easier to use and more robust to 
model misspecification than SEM analysis. Conversely, it 
does not benefit from well-specified structures and might 
be harder to interpret. iBollen (1989) discusses the inter- 
play between factor analysis and SEM. Practical non-l inear 
structural equation models are discussed by lLed (12007 ). but 
none of such approaches rely on nonparametric methods. 
Gaussian processes latent structures appear most l y in th e 
context of dynamical systems (e.g., iKo and Foxl (12009^ 1 . 
However, the connection is typically among data points 
only, not among variables within a data point, where on- 
line filtering is the target application. 

7 CONCLUSION 

The goal of graphical modeling is to exploit the structure 
of real-world problems, but the latent structure is often ig- 
nored. We introduced a new nonparametric approach for 
SEMs by extending a sparse Gaussian process prior as a 
fully Bayesian procedure. Although a standard MCMC 
algorithm worked reasonably well, it is possible as future 
work to study ways of improving mixing times. This can 
be particularly relevant in extensions to ordinal variables, 
where the sampling of thresholds will likely make mixing 
more difficult. Since the bottleneck of the procedure is the 



Another instance of the "whatever you do, some- 
body in psychometrics already did it long before" law: 
http://www.stat.columbia.edu/~cook/movabletype/archives/ 
2009/0 l/a_longstanding.html 



sampling of the pseudo-inputs, one might consider a hy- 
brid approach where a subset of the pseudo-inputs is fixed 
and determined prior to sampling using a cheap heuris- 
tic. New ways of deciding pseudo-input locations based 
on a given measurement model will be required. Evalua- 
tion with larger datasets (at least a few hundred variables) 
remains an open problem. Finally, finding ways of deter- 
mining the graphical structure is also a promising area of 
research. 
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APPENDIX A: FURTHER MCMC 
DETAILS 



number of pseudo-inputs per latent function fi(-), N be the sam- 
ple size, V the number of latent variables and K the common 
number of Gaussian mixture components for each exogenous la- 
tent variable. 

The sampler is a standard Metropolis-Hastings procedure with 
block sampling: random variables are divided into blocks, where 
we sample each block conditioning on the current values of the 
remaining blocks. 

We consider both the non-sparse and sparse variations of GPSEM. 
The blocks are as follows for the non-sparse GPSEM: 

• the linear coefficients for the structural equation of each ob- 
served variable Y,: {Ajo} U Aj 

• the conditional variance for the structural equation of each 
observed variable Yj : v tj 

• the d-th instantiation of each latent variable Xi, x[ ; 

• the set of latent function values {f\ , . . . , f\ } for each 
particular endogenous latent variable Xi 

• the conditional variance for the structural equation of each 
latent variable Xi : 

• the set of latent mixture component indicators 

, ---Zl } for each particular exogenous latent 
variable X% 

• the set of means {fin, . . . , [iiK } for the mixture compo- 
nents of each particular exogenous latent variable Xi 

• the set of variances {vn, . . . , vm} for the mixture compo- 
nents of each particular exogenous latent variable Xi 

• mixture distribution ni corresponding to the probability over 
mixture components for exogenous latent variable Xi 

The blocks for the sparse model are similar, except that 

• all instantiations of a given latent variable xf\ for d = 
1,2, ... ,N, are mutually independent conditioned on the 
functions, pseudo-inputs and pseudo-functions. As such, 
they can be treated as a single block of size N, where all 
elements are sampled in parallel; 

• the d-th instantiation of each pseudo-input for d = 
1,2,..., M 

• all instantiations of latent functions and pseudo-latents func- 
tions {/f , . . . , / i (JV) , /p' , . . . , /> M) } for any particular 
Xi are conditionally multivariate Gaussian and can be sam- 
pled together 

We adopt the convention that, for any particular step described in 
the following procedure, any random variable that is not explic- 
itly mentioned should be considered fixed at the current sampled 
value. Moreover, any density function that depends on such im- 
plicit variables uses the respective implicit values. 



We use a MCMC sampler to draw all variables of interest from 
the posterior distribution of a GPSEM model. Let A/ denote the 



Our implementation uses c ode for subma trix Cholesky updates 
from the library provided by Seeger (2004). 



The measurement model 

The measurement model can be integrated out in principle, if we 
adopt a conjugate normal-inverse gamma prior for the linear re- 
gression of observed variables Y on X. However, we opted for 
a non-conjugate prior in order to evaluate the convergence of the 
sampler when this marginalization cannot be done (as in alterna- 
tive models with non-Gaussian error terms). 

Given the latent variables, the corresponding conditional distribu- 
tions for the measurement model parameters boil down to stan- 
dard Bayesian linear regression posteriors. In our Metropolis- 
Hastings scheme, our proposals correspond to such conditionals, 
as in Gibbs sampling (and therefore have an acceptance probabil- 
ity of 1). 

Let Xpj be the parents of observed variable Yj in the graph and 
let the d-fh instantiation of the corresponding regression input 
be Xp' = [xp.l] T . Let each cofficient \ju have an indepen- 
dent Gaussian prior with mean zero and variance u. Conditioned 
on the error variance v (j , the posterior distribution of the vector 
[Aji, . . . , Ajixp. | , Ajo] T is multivariate Gaussian with covariance 



s j = (Ed=i xpjZpj + l l u ) 1 and mean s > Ed=i x , 

where I is a (|Xp. + 1) x (|Xp, | + 1) identity matrix. 

The derivation for the case where some coefficients Xjk are fixed 
to constants is analogous. 

For a fixed set of linear coefficients {Ajo} U Aj, we now sam- 
ple the conditional variance v tj . Let this variance have a in- 
verse gamma prior (a,b). Its conditional distribution is an inverse 
gamma (a', 6'), where a' = a + N/2, b' = b + £d=i(ei d) ) 2 /2, 



and e 



(d) 



(d) 

y) 



A 



jo 



The structural model: non-sparse GPSEM 

For all i = 1,2, ... ,V and d = 1, 2, . . . , N, we propose each 

new latent variable value x^ individually, and accept or reject it 
based on a Gaussian random walk proposal centered at the current 
value xf"^ . We accept the move with probability 



min < 1 



9xM d) ') 
9xM d) ) 

where, if Xi is not an exogenous variable in the graph, 

g x M d) ) = e-l'^-W'^ 

x nx c6 x Ci P(/i d) |/i V) ) (4) 

Recall that fi(-) is a function of the parents Xp, of Xi in the 
graph. The d-fh instantiation of such parents assume the value 



Xp*'. We use f^ a ' as a shorthand notation for /j(xp'). Morever, 
let Xc^ denote the latent children of Xi in the graph. The symbol 
/j^ d ' refers to the respective function values taken by f c in data 
points {1, 2, . . . , d — 1, d + 1, • - - , N}. Function p(/ c (d) | / c (Xd) ) 
is the conditional density of /i given fc , according to the 
Gaussian process prior. The evaluation of this factor costs Q (N 2 ) 
using standard submatrix Cholesky updates (Seeger, 2004). As 
such, sampling all latent values for Xi takes 0(N 3 ). 

Finally, Xy ( denotes the observed children of Xi, and function 



Id), 



evaluated at yi d \ given its parents (which includes -X^ d ') and 
(implicit) measurement model parameters. This factor can be 
dropped if y c is missing. 

If variable Xi is an exogenous variable, then the factor 



-(x\ 



' gets substituted by 



where z' d ' is the latent mixture indicator for the marginal mix- 
ture of Gaussians model for Xi, with means {fin, . . . , [im} and 
variances {vn, v iK }. 

Given all latent variables, latent function values {/} , • • • , f\ } 
are multivariate Gaussian with covariance matrix 



S/^CK^+I/^.)- 1 

where Ki is the corresponding kernel matrix and I is a N x N 
identity matrix. The respective mean is given by S/jXj /u{ 4 , 



where x 



(UN) 



JNhT 



p(j/c d ' | Xp') is the corresponding density of observed child Y c 



Ad) 



This operation costs 0(N ). 
We sample from this conditional as in a standard Gibbs update. 

Sampling each latent conditional variance can also be done 
by sampling from its conditional. Let v,; i have an inverse gamma 
prior (a^, b(). The conditional distribution for this variance given 
all other random variables is inverse gamma (a'r , b'/), where a'^ = 

a c + N/2 and b' Q = b c + Et^f - f^f 'A 

We are left with sampling the mixture model parameters that cor- 
respond to the marginal distributions of the exogenous latent vari- 
ables. Once we condition on the latent variables, this is com- 
pletely standard. If each mixture mean parameter ^ is given an 
independent Gaussian prior with mean zero and variance tv, its 
conditional given the remaining variables is also Gaussian with 
variance v'^ = l/(l/v n + \Zij\/vij), where Zij is the subset of 
1,2, . . . , N such that d G Zij if and only if z' d ' = j. The corre- 
sponding mean is given by v' n Edgz x i l v ij- If eacn mixture 
variance parameter Vij is given an inverse gamma prior (a T , b n ), 
its conditional is an inverse gamma (a^, b'^), where = a w + 

\Zij\/2, and b' n = K + J2 deZlJ ( x< i d) ~ ^j) 2 /2- The condi- 
tional probability P(zj = j | everything else) is proportional 
t0 £t1i%- 1/ V (a 4 t) " M,3 ' 2/(2 ^ ) . Finally, given a Dirichlet 
prior distribution (ai , . . . , cxk) for each m, its conditional is also 
Dirichlet with parameter vector (aj + \Zn\, . . . ,olk + \Znc\). 



APPENDIX B: A NOTE ON 
DIRECTIONALITY DETECTION 

The assumption of linearity of the measurement model is not only 
a matter of convenience. In SEM applications, observed variables 
are carefully chosen to represent different aspects of latent con- 
cepts of interest and often have a single latent parent. As such, it 
is plausible that children of a particular latent variable are differ- 
ent noisy linear transformations of the target latent variable. This 
differs from other applications of l atent variable Gau ssian process 
models such as those introduced bv lLawrencd j2005) . where mea- 
surements are not designed to explicitly account for target latent 
variables of interest. Moreover, this linearity condition has im- 
portant implications on distinguishing among candidate models. 



Implications for Model Selection 

We assumed that the DAG Q is given. A detailed discussion of 
model selection is left as future work. Instead, we discuss some 
theoretical aspects of a very particular but important structural 
feature that will serve as a building bl ock to more genera l model 
selection procedures, in the spirit of iHover et alj i2008al) : deter- 
mining sufficient conditions for the subproblem of detecting edge 
directionality from the data. Given a measurement model for two 
latent variables Xi and X%, we need to establish conditions in 
which we can test whether the only correct latent structure is 
X\ — s> X2, X2 — > Xi, th e disconnect e d struc ture, or either di- 
rectionality. The results of Hoyer et al. ( 2008a) can be extended 
to the latent variable case by exploiting the conditions of identifi- 
ability discussed in Section l2~T1 as follows. 

Our su fficient conditions are a weaker set of assumptions than 
that of ISilva et all d2006h . We assume that X 1 has at least two 
observable children which are not children of X2 and vice- versa. 
Call these sets {Yi, Y/} and {Y2, Y 2 '}, respectively. Assume all 
error terms ({ei} U {Ci}) are non-Gaussia n^. The variance of a ll 
error terms is assumed to be nonzero. As in Ho ver et al.1 {2008a), 
we also assume Xi and X2 are unconfounded. 

To test whether the model where X\ and X2 are independent be- 
comes easy in this case: the independence model entails that (say) 
Y\ and Y2 are marginally independent. This can b e tested using 
the no nparametric marginal independence test of Gretton et al. 
120071) . 

For the nontrivial case where latent variables are dependent, 
the results of Section |2~71 imply that the measurement model of 
{X\ — > Yi, X\ —¥ Y{} is identifiable up to the scale and sign of 
the latent variables, including the marginal distributions of e\ and 
Ei' . An analogous result applies to {X2, Y2, Y 2 '}. 

Since the measurement model {Yi, Y{, Y2, Y%} of {Xi,X2} is 
identifiable, assume without loss of generality that the linear co- 
efficients corresponding to Xi — > Y\ and X2 —¥ Y2 are fixed to 
1, i.e., Yi = Xi +€i and Y2 = X2 + £2- Also from Section [2~Tl it 
follows that the distribution of {Xi, X2} ca n be identified under 
very general conditions. The main result of IHover et al. (2008a) 
can then be directly applied. That is, data generated by a model 
X2 ~ f(Xi) + r/2, with r\2 being non-Gaussian and independent 
of Xi, cannot be represented by an analogous generative model 
X\ — g(X2) + r/i except in some particular cases that are ruled 
out as implausible. 

Practical Testing 

The test fo r comp aring Xi X2 against X2 Xi in 
IHover et al.1 d2008ij) can be modified to our context as follows: 
we cannot regress X2 on X\ and estimate the residuals £2 since 
Xi and X2 are latent. However, we can do a error-in-variables 
regression of Y2 on Y\ using Y{ and Y 2 ' as instrumental variables 
(Carroll et ail l2004l) : this means we find a function h(-) such that 
Y2 = h(X) + r and Y\ — X + w, for non-Gaussian latent vari- 
ables r, w and X. We then calculate the estimated residuals r of 
this regression, and tes t whether such residuals are independent of 
Yi jGretton et all [2007). If this is true, then we have no evidence 
to discard the hypothesis Xi — > X2 ■ 

The justification for this process is that, if the true model is indeed 
X2 — f(Xi) + r]2, then h(-) = /(•) and r = £2 + 772 in the limit 



"Variations where ei and latent error terms are allowed to be 
Gaussian, as in our original model description are also possible 
and will be treated in the future. 



of infinite data, since t he error-in-variable s regression model is 
identifiable in our case iCarroll et alll2004l) . with X = Xi being 
a consequence of deconvolving Yl and ei . By this result, r will 
be independent of Y\. However, i f the opposite holds (X\ 
X2) then, as in (Hov er etaU l2008a). the residual is not in general 
independent of Yl : give n X\ (= X), there is a d-connecting path 
Y 2 X2 -¥ Xi <- t]i jPearlH2000T) . and r will be a function of 
971, wh ich is dependent on Yi. This is analogous to (Hover et al., 
2008a), but using a different family of regression techniques. 

Error-in-variables regression is a special case of the Gaussian 
process SEM. The main practical difficulty on using GPSEM 
with the pseudo-inputs approximation in this case is that such 
pseudo -inputs formulation impli e s a heteroscedastic regression 
model dSnelson and Ghahramani, 2006). One has either to use 
the GPSEM formulation without pseudo-inputs, or a model linear 
in the parameters but with an explicit, finite, basis dictionary on 
the input space. 



